Recap last lecture

  • introducing Python 🐍

  • learning programming concepts & syntax

    • data types, loops, indexing, functions…
  • working with VS Code Editor

Outline

  • learn about available data sources 🗞️
  • analyze and justify your own data
    • convert .pdf into .txt 🔄
    • count words in Python 🧮

What data sources are there?

  • broadly social
    • newspapers + magazines
    • websites + social media
    • reports by NGOs/GOs
  • scientific articles
  • economic
    • business plans/reports
    • contracts
    • patents

👉 basically, any textual documents…

How does the data look like?

Any text is data, yet some formats are more suitable

  1. datasets like .csv or .tsv 🥰
  2. plain text like .txt 🙂
  3. text in other formats like .pdf 😬

Some great (historical) datasets

Datasets in .csv ready off-the-shelf


😓 There are still not many.

Dedicated search engines for datasets

Use case: Search for existing datasets


👉 search for a topic followed by corpus, text collection or text as data

Search techniques 🔍

Make your (Google) web search more efficient by using dedicated operators. Examples:

  • "computational social science"
  • site:nytimes.com
  • nature OR environment

Swissdox: A game changer

Assemble a news dataset and download as .tsv

  • over 250 Swiss newspapers
  • historical and updated daily
  • needs registration (free)

More publishers

👉 check out other resources licensed by ZHB

Interesting sources as PDFs

Any organization of your interest 👍

Scraping PDFs from websites

Use case: Swiss voting booklets

  • wget to download any files from the internet
# get a single file
wget EXACT_URL

# get all linked pdf from a single webpage
wget --recursive --accept pdf -nH --cut-dirs=5 \
--ignore-case --wait 1 --level 1 --directory-prefix=data \
https://www.bk.admin.ch/bk/de/home/dokumentation/abstimmungsbuechlein.html

# --accept FORMAT_OF_YOUR_INTEREST
# --directory-prefix YOUR_OUTPUT_DIRECTORY

Data is no free lunch 😥

Data is property 🚫

… and has rights too

  • copyrights may further limit access to high quality data
  • check the rights before processing data

Imperfect data: A tail of bias

  • noise in text

    • non-content (e.g. table of content), inconsistent spelling
  • archive holes

    • lost or uncollected data
  • selective corpus curation

    • supposition that key-word(s) captures topic
  • social bias

    • view from somewhere, stereotypes


👉 think about the data and mitigate issues

PDF: Digitized or digital?

Two flavors of .pdf documents

Digitalized PDF made from a scanned page

Native PDF converted from digital document (e.g., docx)

Optical Character Recognition (OCR)

  • OCR ~ convert images/scans into text
    • may support handwriting + Fraktur texts
  • conversion steps
    • convert to b/w image
    • run OCR model
    • correct spelling issues

Steps when performing OCR

Common conversions

Your text is of a particular type

⬇️

digital native documents
.pdf, .docx, .html

⬇️

extract text without formatting

⬇️

scans of (old) documents
.pdf, .jpg, .png

⬇️

Optical Character Recognition (OCR)


machine-readable formats: .txt or .csv

What data are you interested in?

Think about your mini-project

  • present in last lecture
  • analyze any collection of text documents
    • compare historically
    • compare between actors
  • form groups of 2-3 people
  • requirements
    • apply quantitative measures on multiple documents
    • interpret and present results in class
    • share executable script

Illustration of text analysis generated by Image Creator from Microsoft Copilot

Assignment #2 ✍️

  • get/submit via OLAT
    • starting tomorrow
    • deadline: 19 April 2025, 23:59
  • discuss issues on OLAT forum

Let’s start coding

In-class: Exercises I

  1. Make sure that your local copy of the Github repository KED2025 is up-to-date with git pull.
  2. Use wget to download cogito and its predecessor uniluAKTUELL issues (PDF files) from the UniLu website. Start with downloading one issue first and then try to automatize the process to download all the listed issued using arguments for the wget command.
  3. Convert the cogito and uniluAKTUELL PDF files into TXT files. You can use the code given for native, digital PDFs (no OCR needed). Try with a single issue first and then write a loop to batch process all of them.
  4. What is the University of Lucerne talking about in its issues? Use the commands of the previous lectures to count the vocabulary.
  5. Do the same as in 4.), yet analyze the vocabulary of cogito and uniluAKTUELL issues separately. Does the language and topics differ between the two magazines?